Goto

Collaborating Authors

 batch size 256





On Measuring Localization of Shortcuts in Deep Networks

Tsoy, Nikita, Konstantinov, Nikola

arXiv.org Machine Learning

Shortcuts, spurious rules that perform well during training but fail to generalize, present a major challenge to the reliability of deep networks (Geirhos et al., 2020). However, the impact of shortcuts on feature representations remains understudied, obstructing the design of principled shortcut-mitigation methods. To overcome this limitation, we investigate the layer-wise localization of shortcuts in deep models. Our novel experiment design quantifies the layer-wise contribution to accuracy degradation caused by a shortcut-inducing skew by counterfactual training on clean and skewed datasets. We employ our design to study shortcuts on CIFAR-10, Waterbirds, and CelebA datasets across VGG, ResNet, DeiT, and ConvNeXt architectures. We find that shortcut learning is not localized in specific layers but distributed throughout the network. Different network parts play different roles in this process: shallow layers predominantly encode spurious features, while deeper layers predominantly forget core features that are predictive on clean data. We also analyze the differences in localization and describe its principal axes of variation. Finally, our analysis of layer-wise shortcut-mitigation strategies suggests the hardness of designing general methods, supporting dataset- and architecture-specific approaches instead.


. We would like to point out that

Neural Information Processing Systems

We would like to thank all the valuable and constructive feedback from the reviewers. AdaReg does not explicitly enforce the weight matrices to be positively/negatively correlated. Therefore, our method is orthogonal to but not contradictory with Dropout. Inspired by this result, we explored hyperparameter learning by empirical Bayes. BatchNorm, we do observe that smaller batch size leads to better generalizations.





On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

Malladi, Sadhika, Lyu, Kaifeng, Panigrahi, Abhishek, Arora, Sanjeev

arXiv.org Artificial Intelligence

Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.


Lessons for Improving Training Performance -- Part 1

#artificialintelligence

Nine months ago, as part of a joint reference architecture launch with Nvidia, Pure Storage published TensorFlow deep learning performance results. The goal of creating a joint architecture with Nvidia was to identify and solve performance bottlenecks present in an end-to-end deep learning environment -- especially at scale. During creation of our reference architecture, my team identified and improved performance issues across storage, networking, and compute. Our system is a physical entity, and everything from cabling configuration and MTU size to Tensorflow prefetch buffer size can impact performance. The software and hardware stack in our test environment.